Sunah Park

This markdown file is created by Sunah Park for extended lab exercises in the book An Introduction to Statistical Learning with Applications in R (ISLR).


Setup for code chunks

rm(list=ls())
# default r markdown global options in document
knitr::opts_chunk$set(
   ########## Text results ##########
    echo=TRUE, 
    warning=FALSE, # to preserve warnings in the output 
    error=FALSE, # to preserve errors in the output
    message=FALSE, # to preserve messages
    strip.white=TRUE, # to remove the white lines in the beginning or end of a source chunk in the output 

    ########## Cache ##########
    cache=TRUE,
   
    ########## Plots ##########
    fig.path="", # prefix to be used for figure filenames
    fig.width=8,
    fig.height=6,
    dpi=200
)

library(ISLR)
library(splines) # bs() function
library(ggplot2)
library(gam) # gam() function
library(akima) # two-dimensional surface
head(Wage,3)
##        year age           maritl     race       education
## 231655 2006  18 1. Never Married 1. White    1. < HS Grad
## 86582  2004  24 1. Never Married 1. White 4. College Grad
## 161300 2003  45       2. Married 1. White 3. Some College
##                    region       jobclass         health health_ins
## 231655 2. Middle Atlantic  1. Industrial      1. <=Good      2. No
## 86582  2. Middle Atlantic 2. Information 2. >=Very Good      2. No
## 161300 2. Middle Atlantic  1. Industrial      1. <=Good     1. Yes
##         logwage      wage
## 231655 4.318063  75.04315
## 86582  4.255273  70.47602
## 161300 4.875061 130.98218
Wage<-na.omit(Wage)

Linear models are relatively simple to describe and implement, and have advantages over other approaches in terms of interpretation and inference. However, standard linear regression can have significant limitations in terms of predictive power. Some simple extensions of linear models are examined here below.

1. Polynomial Regression

Polynomial regression extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. This approach provides a simple way to provide a non-linear fit to a data.

fit<-lm(wage~poly(age,4), data=Wage)
coef(summary(fit))
##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)    111.70361  0.7287409 153.283015 0.000000e+00
## poly(age, 4)1  447.06785 39.9147851  11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3  125.52169 39.9147851   3.144742 1.678622e-03
## poly(age, 4)4  -77.91118 39.9147851  -1.951938 5.103865e-02
fit2<-lm(wage~poly(age,4, raw=T), data=Wage)
coef(summary(fit2))
##                             Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)            -1.841542e+02 6.004038e+01 -3.067172 0.0021802539
## poly(age, 4, raw = T)1  2.124552e+01 5.886748e+00  3.609042 0.0003123618
## poly(age, 4, raw = T)2 -5.638593e-01 2.061083e-01 -2.735743 0.0062606446
## poly(age, 4, raw = T)3  6.810688e-03 3.065931e-03  2.221409 0.0263977518
## poly(age, 4, raw = T)4 -3.203830e-05 1.641359e-05 -1.951938 0.0510386498
fit3<-lm(wage~age+I(age^2)+I(age^3)+I(age^4), data=Wage)
coef(summary(fit3))
##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) -1.841542e+02 6.004038e+01 -3.067172 0.0021802539
## age          2.124552e+01 5.886748e+00  3.609042 0.0003123618
## I(age^2)    -5.638593e-01 2.061083e-01 -2.735743 0.0062606446
## I(age^3)     6.810688e-03 3.065931e-03  2.221409 0.0263977518
## I(age^4)    -3.203830e-05 1.641359e-05 -1.951938 0.0510386498
fit4<-lm(wage~cbind(age, age^2, age^3, age^4), data=Wage)
coef(summary(fit4))
##                                         Estimate   Std. Error   t value
## (Intercept)                        -1.841542e+02 6.004038e+01 -3.067172
## cbind(age, age^2, age^3, age^4)age  2.124552e+01 5.886748e+00  3.609042
## cbind(age, age^2, age^3, age^4)    -5.638593e-01 2.061083e-01 -2.735743
## cbind(age, age^2, age^3, age^4)     6.810688e-03 3.065931e-03  2.221409
## cbind(age, age^2, age^3, age^4)    -3.203830e-05 1.641359e-05 -1.951938
##                                        Pr(>|t|)
## (Intercept)                        0.0021802539
## cbind(age, age^2, age^3, age^4)age 0.0003123618
## cbind(age, age^2, age^3, age^4)    0.0062606446
## cbind(age, age^2, age^3, age^4)    0.0263977518
## cbind(age, age^2, age^3, age^4)    0.0510386498
agelims<-range(Wage$age)
age.grid<-seq(from=agelims[1], to=agelims[2])
preds<-predict(fit, newdata=list(age=age.grid), se=TRUE)
se.bands<-cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)

preds2<-predict(fit2, newdata=list(age=age.grid), se=TRUE)
max(abs(preds$fit-preds2$fit))
## [1] 7.81597e-11
plot(Wage$age, Wage$wage, xlim=agelims,col="grey",pch=19)
lines(age.grid, preds$fit, col="red")
matlines(age.grid, se.bands, col="blue", lty=2)

Decision on the degree of polynomial

fit.1<-lm(wage~age, data=Wage)
fit.2<-lm(wage~poly(age,2), data=Wage)
fit.3<-lm(wage~poly(age,3), data=Wage)
fit.4<-lm(wage~poly(age,4), data=Wage)
fit.5<-lm(wage~poly(age,5), data=Wage)

anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We perform the ANOVA(Analysis of Variance) function in order to test the null hypothesis that a model M1 is sufficient to explain the data against the alternative hypotehsis that a more complex model M2 is required. In order to use the anova() function, M1 and M2 must be nested models: the predictors in M1 must be a subset of the predictors in M2.

The p-value comparing the linear Model 1 to the quadratic Model 2 is essentially zero, indicating that a linear fit is not sufficient. Similarly the p-value comparing the quadratic Model 2 to the cubic Model 3 is very low (0.0017), so the quadratic fit is also insufficient. The p-value comparing the cubic and degree-4 polynomials, Model 3 and Model 4, is approximately 5% while the degree-5 polynomial Model 5 seems unnecessary because its p-value is 0.37. Hence, either a cubic or a quartic polynomial appear to provide a reasonable fit to the data, but lower- or higher-order models are not justified.

In this case, instead of using the anova() function, we could have obtained these p-values more succinctly by exploiting the fact that poly() creates orthogonal polynomials. Note that the p-values are the same, and in fact the square of the t-statistics are squal to the F-statistics from the anova() function.

coef(summary(fit.5))
##                 Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)    111.70361  0.7287647 153.2780243 0.000000e+00
## poly(age, 5)1  447.06785 39.9160847  11.2001930 1.491111e-28
## poly(age, 5)2 -478.31581 39.9160847 -11.9830341 2.367734e-32
## poly(age, 5)3  125.52169 39.9160847   3.1446392 1.679213e-03
## poly(age, 5)4  -77.91118 39.9160847  -1.9518743 5.104623e-02
## poly(age, 5)5  -35.81289 39.9160847  -0.8972045 3.696820e-01

However, the ANOVA method works whether or not we used orthogonal polynomials; it also works when we have other terms in the model as well.

fit.1<-lm(wage~education+age, data=Wage)
fit.2<-lm(wage~education+poly(age,2), data=Wage)
fit.3<-lm(wage~education+poly(age,3), data=Wage)
anova(fit.1, fit.2, fit.3)
## Analysis of Variance Table
## 
## Model 1: wage ~ education + age
## Model 2: wage ~ education + poly(age, 2)
## Model 3: wage ~ education + poly(age, 3)
##   Res.Df     RSS Df Sum of Sq        F Pr(>F)    
## 1   2994 3867992                                 
## 2   2993 3725395  1    142597 114.6969 <2e-16 ***
## 3   2992 3719809  1      5587   4.4936 0.0341 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As an alternative to using hypothesis tests and ANOVA, we could choose the polynomial degree using cross-validation(Chapter 5).

Next we consider the task of predicting whether an individual earns more than $250,000 per year. I() was used to create the binary response variable.

fit<-glm(I(wage>250)~poly(age,4),data=Wage, family=binomial) # polynomial logistic regression
preds<-predict(fit, newdata=list(age=age.grid), se=TRUE)

pfit<-exp(preds$fit)/(1+exp(preds$fit))
se.bands.logit<-cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)
se.bands<-exp(se.bands.logit)/(1+exp(se.bands.logit))
preds<-predict(fit, newdata=list(age=age.grid), type="response", se=T)

plot(Wage$age, I(Wage$wage>250), xlim=agelims, type="n", ylim=c(0,0.2))
points(Wage$age, I((Wage$wage>250)/5), cex=.5, col="grey",pch=19)
lines(age.grid, pfit, col="red")
matlines(age.grid, se.bands, col="blue",lty=2)

We have drawn the age values corresponding to the observations with wage values above 250 as gray marks on the top of the plot, and those with wage values below 250 are shown as gray marks on the bottom of the plot.


2. Step function

Step functions cut the range of a variable into K distinct regions in order to produce a qualitative variable. This has the effect of fitting a piecewise constant function.

table(cut(Wage$age,5)) # cut() for step function
## 
## (17.9,30.4] (30.4,42.8] (42.8,55.2] (55.2,67.6] (67.6,80.1] 
##         522         987        1081         366          44
fit<-lm(wage~cut(age,5), data=Wage)
preds<-predict(fit, newdata=list(age=age.grid), se=TRUE)
se.bands<-cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)

coef(summary(fit))
##                        Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)            89.53707   1.766687 50.680770 0.000000e+00
## cut(age, 5)(30.4,42.8] 24.36971   2.184468 11.155902 2.406337e-28
## cut(age, 5)(42.8,55.2] 29.06134   2.151363 13.508342 2.079569e-40
## cut(age, 5)(55.2,67.6] 29.36243   2.751855 10.670047 4.113600e-26
## cut(age, 5)(67.6,80.1]  6.47140   6.336385  1.021308 3.071911e-01
plot(Wage$age, Wage$wage, xlim=agelims,pch=19, col="grey")
lines(age.grid, preds$fit, col="red")
matlines(age.grid, se.bands, col="blue",lty=2)

cut() function autimatically picks the cutpoints at 30.4, 42.8,55.2 and 67.6. The function cut() returns an ordered categorical variable; the lm() function then creates a set of dummy variables for use in the regression.


3. Regression Splines

Regression splines are more flexible than polynomials and step functions, and in fact are an extension of the two. They involve dividing the range of X into K distinct regions. Within each region, a polynomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or knots. Provided that the interval is divided into enough regions, this can produce an extremely flexible fit.

fit<-lm(wage~bs(age, knots=c(25,40,60)), data=Wage) # bs: Generate the B-spline basis matrix for a polynomial spline
summary(fit)
## 
## Call:
## lm(formula = wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.832 -24.537  -5.049  15.209 203.207 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       60.494      9.460   6.394 1.86e-10 ***
## bs(age, knots = c(25, 40, 60))1    3.980     12.538   0.317 0.750899    
## bs(age, knots = c(25, 40, 60))2   44.631      9.626   4.636 3.70e-06 ***
## bs(age, knots = c(25, 40, 60))3   62.839     10.755   5.843 5.69e-09 ***
## bs(age, knots = c(25, 40, 60))4   55.991     10.706   5.230 1.81e-07 ***
## bs(age, knots = c(25, 40, 60))5   50.688     14.402   3.520 0.000439 ***
## bs(age, knots = c(25, 40, 60))6   16.606     19.126   0.868 0.385338    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.92 on 2993 degrees of freedom
## Multiple R-squared:  0.08642,    Adjusted R-squared:  0.08459 
## F-statistic: 47.19 on 6 and 2993 DF,  p-value: < 2.2e-16
pred<-predict(fit, newdata=list(age=age.grid), se=TRUE)
se.bands<-cbind(pred$fit+2*pred$se.fit, pred$fit-2*pred$se.fit)
plot(Wage$age, Wage$wage, col="grey",pch=19)
lines(age.grid, pred$fit, col="red")
matlines(age.grid, se.bands, col="blue", lty=2)

Here we have prespecified knots at ages 25, 40 and 60. This produces a spline with six basis functions (A cubic spline with three knots has seven degrees of freedom; these degress of freedom are used up by an intercept, plus six basis functions). We could also use the df option to produce a spline with knots at uniform quantiles of the data.

dim(bs(Wage$age, knots=c(25,40,60))) # bs(): generate the B-spline basis matrix for a polynomial spline
## [1] 3000    6
dim(bs(Wage$age, df=6))
## [1] 3000    6
attr(bs(Wage$age,df=6), "knots")
##   25%   50%   75% 
## 33.75 42.00 51.00

At df=6, R chooses knots at ages 33.8, 42.0 and 51.0, which correspond to the 25th, 50th and 75th percentiles of age. The function bs() also has a degree argument, so we can fit splines of any degree, rather than the default degree of 3.

In order to instead fit a natural spline, we use the ns() function.

fit2<-lm(wage~ns(age,df=4), data=Wage)
pred2<-predict(fit2, newdata=list(age=age.grid), se=T)

se.bands<-cbind(pred2$fit+2*pred2$se.fit, pred2$fit-2*pred2$se.fit)
plot(Wage$age, Wage$wage, col="grey",pch=19)
lines(age.grid, pred2$fit, col="red")
matlines(age.grid, se.bands, col="blue", lty=2)

As with the bs() function, we could instead specify the knots direclty using the knots option.


4. Smoothing splines

Smoothing splines are similar to regression splines, but arise in a slightly different situation. Smoothing splines result from minimizing a residual sum of squares criterion subject to a smoothness penalty.

In order to fit a smoothing spline, we use the smooth.spline() function.

fit<-smooth.spline(Wage$age, Wage$wage, df=16)
fit2<-smooth.spline(Wage$age, Wage$wage, cv=TRUE)
fit2$df
## [1] 6.794596
plot(Wage$age, Wage$wage, xlim=agelims, col="grey", pch=19)
lines(fit, col="red")
lines(fit2, col="blue")
legend("topright", legend=c("16 DF", "6.8 DF"), col=c("red","blue"), lty=1)

In the first call of smooth.spline() we specified df=16. The function then determines which value of lambda leads to 16 df. In the second call to smooth.spline() we select the smoothness level by cross-validation; this results in a value of lambda that yields 6.8 df.


5. Local regression

Local regression is similar to splines, but differs in an important way. The regions are allowed to overlap, and indeed they do so in a very smooth way. In order to perform local regression, we use the loess() function.

fit<-loess(wage~age, span=.2, data=Wage)
fit2<-loess(wage~age, span=.5, data=Wage)

plot(Wage$age, Wage$wage, xlim=agelims, col="grey",pch=19)
lines(age.grid, predict(fit, data.frame(age=age.grid)), col="red")
lines(age.grid, predict(fit2, data.frame(age=age.grid)),col="blue")
legend("topright", legend=c("span=0.2", "span=0.5"), col=c("red","blue"), lty=1)

We performed local linear regression using spans of 0.2 and 0.5: that is, each neighborhood consists of 20% or 50% of the observations. The larger the span, the smoother the fit.


6. Generalized additive models (GAM)

GAM allow us to extend the methods above to deal with multiple predictors. We fit a GAM to predict wage using natural spline functions of year and age, treating education as a qualitative predictor using gam() function from gam library.

gam1<-lm(wage~ns(year,4)+ns(age,5)+education, data=Wage)
gam.m3<-gam(wage~s(year,df=4)+s(age,df=5)+education, data=Wage)
par(mfrow=c(1,3))
plot(gam.m3, se=TRUE)

par(mfrow=c(1,3))
plot.Gam(gam1, se=TRUE)

The s() function is used to indicate that we would like to use a smoothing spline. We specify that the function of year should have 4 degrees of freedom, and that the function of age will have 5 degress of freedom. Since education is qualitative, we leave it as is, and it is converted into four dummy variables.

Even though gam1 is not of class gam but rather of class lm, we can still use plot.gam() on it. We have to use plot.gam() rather than the generic plot() function.

The function of year looks rather linear. We can perform a series of ANOVA tests in order to determine which of these three models is best: a GAM that excludes year(m1), a GAM that uses a linear function of year(m2), or a GAM that uses a spline function of year(m3).

gam.m1<-gam(wage~s(age,5)+education, data=Wage)
gam.m2<-gam(wage~year+s(age,5)+education, data=Wage)
gam.m3<-gam(wage~s(year,4)+s(age,5)+education, data=Wage)
anova(gam.m1, gam.m2, gam.m3, test="F")
## Analysis of Deviance Table
## 
## Model 1: wage ~ s(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
## Model 3: wage ~ s(year, 4) + s(age, 5) + education
##   Resid. Df Resid. Dev Df Deviance       F    Pr(>F)    
## 1      2990    3711731                                  
## 2      2989    3693842  1  17889.2 14.4771 0.0001447 ***
## 3      2986    3689770  3   4071.1  1.0982 0.3485661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We find that there is compelling evidence that a GAM with a linear function of year(m2) is beter than a GAM that does not include year at all(m1) (p-value of 0.00014). However, there is no evidence that a non-linear function of year is needed (p-value=0.349). In other words, based on the results of this ANOVA, M2 is preferred.

summary(gam.m3)
## 
## Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -119.43  -19.70   -3.33   14.17  213.48 
## 
## (Dispersion Parameter for gaussian family taken to be 1235.69)
## 
##     Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3689770 on 2986 degrees of freedom
## AIC: 29887.75 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## s(year, 4)    1   27162   27162  21.981 2.877e-06 ***
## s(age, 5)     1  195338  195338 158.081 < 2.2e-16 ***
## education     4 1069726  267432 216.423 < 2.2e-16 ***
## Residuals  2986 3689770    1236                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F  Pr(F)    
## (Intercept)                          
## s(year, 4)        3  1.086 0.3537    
## s(age, 5)         4 32.380 <2e-16 ***
## education                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-values for year and age (from Anova for Nonparametric Effects) correspond to a null hypothesis of a linear relationship versus the alternative of a non-linear relationship. The large p-value for year reinforces our conclusion from the ANOVA test that a linear function is adequate for this term. However, there is very clear evidence that a non-linear term is required for age.

pred<-predict(gam.m2, newdata=Wage)
# We can also use local regression fits as building blocks in a GAM, using the lo() function
gam.lo<-gam(wage~s(year, df=4)+lo(age, span=0.7)+education, data=Wage)
par(mfrow=c(1,3))
plot.Gam(gam.lo, se=TRUE)

## Local regression using lo()
gam.lo.i<-gam(wage~lo(year, age, span=0.5)+education, data=Wage) # fits a two-term model, in which the first term is an interaction between year and age, fit by a local regression surface
par(mfrow=c(1,2))
plot(gam.lo.i)

## Logistic Regression GAM using I()
gam.lr<-gam(I(wage>250)~year+s(age, df=5)+education, family=binomial, data=Wage) 
par(mfrow=c(1,3))
plot(gam.lr, se=T)

table(Wage$education, I(Wage$wage>250)) # We see that there are no high earners in the < HS category
##                     
##                      FALSE TRUE
##   1. < HS Grad         268    0
##   2. HS Grad           966    5
##   3. Some College      643    7
##   4. College Grad      663   22
##   5. Advanced Degree   381   45
gam.lr.s<-gam(I(wage>250)~year+s(age,df=5)+education, family=binomial, data=Wage, 
              subset=(education!="1. < HS Grad")) # Fit a logistic regression GAM using all but education 1. < HS Grad -> This provides more sensible results
par(mfrow=c(1,3))
plot(gam.lr.s, se=T)